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ABSTRACT 

Web resources are increasingly interactive, resulting in re¬ 
sources that are increasingly difficult to archive. The archival 
difficulty is based on the use of client-side technologies (e.g., 
JavaScript) to change the client-side state of a representa¬ 
tion after it has initially loaded. We refer to these represen¬ 
tations as deferred representations. We can better archive 
deferred representations using tools like headless browsing 
clients. We use 10,000 seed Universal Resource Identifiers 
(URIs) to explore the impact of including PhantomJS - a 
headless browsing tool - into the crawling process by com¬ 
paring the performance of wget (the baseline), PhantomJS, 
and Heritrix. Heritrix crawled 2.065 URIs per second, 12.15 
times faster than PhantomJS and 2.4 times faster than wget. 
However, PhantomJS discovered 531,484 URIs, 1.75 times 
more than Heritrix and 4.11 times more than wget. To take 
advantage of the performance benefits of Heritrix and the 
URI discovery of PhantomJS, we recommend a tiered crawl¬ 
ing strategy in which a classifier predicts whether a repre¬ 
sentation will be deferred or not, and only resources with 
deferred representations are crawled with PhantomJS while 
resources without deferred representations are crawled with 
Heritrix. We show that this approach is 5.2 times faster than 
using only PhantomJS and creates a frontier (set of URIs to 
be crawled) 1.8 times larger than using only Heritrix. 
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I. INTRODUCTION 

The Web - by design and demand - continues to change. To¬ 
day’s Web users expect Web resources to provide application- 
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like interactive features, client-side state changes, and per¬ 
sonalized representations. These features enhance the brows¬ 
ing experience, but make archiving the resulting represen¬ 
tations difficult - if not impossible. We refer to the ease of 
archiving a Web resource as arehivahility [8]. 

Web resources are ephemeral by nature, making archives like 
the Internet Archive ElllS] valuable to Web users seeking 
to revisit prior versions of the Web. Users (and robots) 
utilize archives in a variety of ways lansiiis]. Live Web 
resources are more heavily leveraging JavaScript (i.e., Ajax) 
to load embedded resources, which leads to the live Web 
“leaking” into the archive [9] or missing embedded resources 
in the archives, both of which ultimately results in reduced 
archival quality [7]. 

We define we define deferred representations as representa¬ 
tions of resources that use of JavaScript and other client- 
side technologies to load embedded resources or fully con¬ 
struct a representation and, therefore, have low archivabil- 
ity. Deferred refers to the final representation that is not 
fully realized and constructed until after the client loads 
the page and executes the client-side code. The client will 
render the representation on the user-agent and user inter¬ 
actions and events that occur within the representation on 
the client. The final representation is deferred until after 
the user-agent, JavaScript, and user events complete their 
execution on the resource. From this point forward, we will 
refer to representations dependent upon these factors as de¬ 
ferred representations. 

Conventional Web crawlers (e.g., Heritrix, wget) are not 
equipped with the necessary tools to execute JavaScript dur¬ 
ing the archival process [6] and subsequently never derefer¬ 
ence the URIs of the resources embedded via JavaScript and 
are required to complete the deferred representation. Phan¬ 
tomJS allows JavaScript to execute on the client, rendering 
the representation as would a Web browser. In the archives, 
the missing embedded resources return a non-200 HTTP sta¬ 
tus (e.g., 404, 503) when their Universal Resource Identifiers 
(URIs) are dereferenced, leaving pages ineomplete. Deferred 
representations can also lead to zombies which occur when 
archived versions of pages inappropriately load embedded 
resources from the live Web [9], leaving pages incorrect, or 
more accurately, prima faeie violative [2] . 

We investigate the impact of crawling deferred representa¬ 
tions as the first step in an improved archival framework that 


can replay deferred representations both completely and cor¬ 
rectly. We measure the expected increase in frontier (list of 
URIs to be crawled) size and wall-clock time required to 
archive resources, and investigate a way to recognize de¬ 
ferred representations to optimize crawler performance us¬ 
ing a two-tiered approach that combines PhantomJS and 
Heritrix. Our efforts measure the crawling tradeoff between 
traditional archival tools and tools that can better archive 
JavaScript with headless browsing - a tradeoff that was 
anecdotally understood but not yet measured. 

Throughout this paper we use Memento Framework termi¬ 
nology. Memento I3Z1 is a framework that standardizes Web 
archive access and terminology. Original (or live web) re¬ 
sources are identified by URI-R, and archived versions of 
URI-Rs are called mementos and are identified by URI-M. 

2. RELATED WORK 

Archivability helps us understand what makes representa¬ 
tions easier or harder to archive. Banos et al. created an 
algorithm to evaluate archival success based on adherence to 
standards for the purpose of assigning an archivability score 
UJ. In our previous work, we studied the factors influencing 
archivability, including accessibility standards and their im¬ 
pact on memento completeness, demonstrating that devia¬ 
tion from accessibility standards leads to reduced archivabil¬ 
ity m We also demonstrated the correlation between the 
adoption of JavaScript and Ajax and the number of missing 
embedded resources in the archives [8]. 

Spaniol measured the quality of Web archives based on match¬ 
ing crawler strategies with resource change rates [in||33l|3l]. 
Ben Saad and Gangarski performed a similar study regard¬ 
ing the importance of changes on a page [5]- Gray and 
Martin created a framework for high quality mementos and 
assessed their quality by measuring the missing embedded 
resources m- In previous work, we measured the relative 
damage caused to mementos that were missing embedded re¬ 
sources to quantify the damage caused by missing resources 
loaded by JavaScript [7]. These works study quality, helping 
us understand what is missing from mementos. 

David Rosenthal spoke about the difficulty of archiving rep¬ 
resentations enabled by JavaScript [23 I23- Google has 
made efforts toward indexing deferred representations - a 
step in the direction of solving the archival challenges posed 
by deferred representations [6]. Google’s indexing focuses on 
rendering an accurate representation for indexing and dis¬ 
covering new URIs, but does not completely solve the chal¬ 
lenges to archiving caused by JavaScript. Archiving web 
resources and indexing representation content are different 
activities that have differing goals and processes. 

Several efforts have studied client-side state. Mesbah et al. 
performed several experiments regarding crawling and in¬ 
dexing representations of Web pages that rely on JavaScript 
min]. These works have focused mainly on search engine 
indexing and automatic testing 123121] rather than archiv¬ 
ing, but serve to illustrate the pervasive problem of deferred 
representations. Dincturk et al. constructed a model for 
crawling Rich Internet Applications (RIAs) by discovering 
all possible client-side states and identifying the simplest 
possible state machine to represent the states m- 


These prior works have focused on archival difficulties of 
crawling and indexing deferred representations, but have not 
explored the impact of archiving deferred representations on 
archival processes and crawlers. We measure the trade-off 
between speed and completeness of crawling techniques. 


3. BACKGROUND 

Web crawlers operate by starting with a finite set of seed 
URI-Rs in a frontier - or list of crawl targets - and add 
to the frontier by extracting URIs from the representations 
returned. Representations of Web resources are increasingly 
reliant on JavaScript and other client-side technologies to 
load embedded resources and control the activity on the 
client. Web browsers use a JavaScript engine to execute the 
client side code; Web crawlers traditionally do not have such 
an engine or the ability to execute client-side code because 
of the resulting loss of crawling speed. The client-side code 
can be used to request additional data or resources from 
servers (e.g., via Ajax) after the initial page load. Grawlers 
are unable to discover the resources requested via Ajax and, 
therefore, are not adding these URIs to their frontiers. The 
crawlers are missing embedded resources, which ultimately 
causes the mementos to be incomplete. 

To mitigate the impact of JavaScript and Ajax on archiv¬ 
ability, traditional crawlers that do not execute JavaScript 
(e.g., Heritrix) have constructed approaches for extracting 
links from embedded JavaScript to be added to crawl fron¬ 
tiers. Even though it does not execute JavaScript, Heritrix 
V. 3.1.4 does peek into the embedded JavaScript code to ex¬ 
tract links where possible m- These processes rely on string 
matching and regular expressions to recognize URIs men¬ 
tioned in the JavaScript. This is a sub-optimal approach be¬ 
cause JavaScript may construct URIs from multiple strings 
during execution, leading to an incomplete URI extracted 
by the crawler. 

Because archival crawlers do not execute JavaScript, what is 
archived by automatic crawlers is increasingly different than 
what users experience. A solution to this challenge of archiv¬ 
ing deferred representations is to provide crawlers with a 
JavaScript engine and allow headless browsing (i.e., allow a 
crawler to operate like a browser) using a technology such 
as PhantomJS. However, this change in crawling method 
impacts crawler performance, frontier size, and crawl time. 


4. MOTIVATING EXAMPLES 

To illustrate the challenge of archiving resources with de¬ 
ferred representations, we consider the resource at URI-R 
http: //WWW. truthinshredding. com/ and its mementos in 
Figure We took a PNG snapshot of the live- Web re¬ 
source as rendered in Mozilla Firefox ( Figur e [l(a)[ ), the re¬ 
source as loaded by PhantomJS (Figure p(b)[ ), and the mem¬ 
ento created by Heritrix and vi ewed in a local installation of 
the Wayback Machine (Figure [T(^. The title of the page 
“Tru th in Shredding” appe ars i n a d ifferent font in Figure 
1(a) I than in Figures l(b)| and |l(c)| not due to a missing 


style sheet but rather an incompatibility of the font for the 
headless browser. 


The live-Web resource loads embedded resources (annotated 

















(a) The live resource at URI-R http: 
//WWW.truthinshredding.com/ loads 
A, R, and (J via Java!Script. 



(b) Using PhantomJS, the advertise- (c) Using Heritrix, the embedded re- 
ment (B) and video (C) are found but sources A, B, and C are missed, 
the account frame (A) is missed. 


Figure 1: Neither archival tool captures all embedded resources, but PhantomJS discovers the URI-Rs of 
two out of three embedded resources dependent upon JavaScript (B, C) while Heritrix misses all of them. 


as A, B, and C) via JavaScript. Embedded Resource A is an 
HTML page loaded into an iframe. The original resources 
are described in Table [T] 

Embedded Resource A, after it loads into the iframe, uses 
JavaScript to pull the prohle image into the page from URI- 
Rai . Embedded Resource B is an advertisement that uses 
the JavaScript at URI-Rbi to pull in ads to the page. Em¬ 
bedded Resource C is a YouTube video that is embedded in 
the page using the following HTML for an iframe: 


<ifraine allowfullscreen="" frameborder="0" height= 
"281" src="//WWW.youtube.com/embed/QyL14Fd4cGA?rel 
=0" width="500"></iframe>. 


PhantomJS does not load Embedded Resource A, poten¬ 
tially because the host resource completes loading before 
the page embedded in the iframe can hnish loading. Phan¬ 
tomJS stops recording embedded URIs and monitoring the 
representation after a page has completed loading, and Em¬ 
bedded Resource A executes its JavaScript to load the pro¬ 
file picture after the main representation has completed the 
page loacQ PhantomJS does discover the advertisement 
(Embedded Resource B) and the YouTube video (Embed¬ 
ded Resource C). Even though the headless browser used by 
PhantomJS does not have the plugin necessary to display 
the video, the URI-R is still discovered by PhantomJS. 

Heritrix fails to identify the URI-Rs for the Embedded Re¬ 
sources A, B, and C. When the memento created by Heritrix 
is loaded by the Wayback Machine, Embedded Resources A, 
B, and C are missing. This is attributed to Heritrix, which 


^PhantomJS scripts can be written to avoid this race- 
condition using longer timeouts or client-side event detec¬ 
tion, but this is outside the scope of this paper. 


does not discover the URI-Rs for these resources during the 
crawl. When viewing the memento through the Wayback 
Machine, the JavaScript responsible for loading the embed¬ 
ded resources is executed resulting in either a zombie re¬ 
source {prima facie violative) or HTTP 404 response (in¬ 
complete) for the embedded URL 

Heritrix’s inability to discover the embedded URI-Rs could 
be mitigated by utilizing PhantomJS during the crawl. How¬ 
ever, this raises many questions, most notably: How much 
slower will the crawl time be? How many additional em¬ 
bedded resources could it recover and potentially need to 
store? Can we optimize the crawl approach based on the de¬ 
tection of deferred representations? Our investigation into 
these questions will assess the feasibility of combining Her¬ 
itrix with PhantomJS to balance the speed of Heritrix with 
the completeness of PhantomJS. 

5. COMPARING CRAWLS 

We designed an experiment to measure the performance dif¬ 
ferences between a command-line archival tool (wget [E]), a 
traditional crawler (the Internet Archive’s Heritrix Crawler 
[23|30]), and a headless browser client (PhantomJS). Nei¬ 
ther Heritrix nor wget execute the client-side JavaScript, 
while PhantomJS does execute client-side JavaScript. 

We constructed a 10,000 URI-R dataset by randomly gen¬ 
erating a Bitly URI and extracting its redirection target 
(identical to the process used to create the Bitly data sub¬ 
set in m)- We split the 10,000 URI dataset into 20 sets of 
500 seed URI-Rs and used wget, Heritrix, and PhantomJS 
to crawl each set of seed URI-Rs. We repeated each crawl 
ten times to establish an average performance, resulting in 
ten different crawls of the 10,000 URI dataset (executing the 
crawl one of the 500-URI sets at a time) with wget, Heritrix, 
and PhantomJS. We measured the increase in frontier size 
{\F\) and the URIs per second {turn) to crawl the resource. 

















































































URI ID 
URI-Ra 


URI-Ra 


URI-Rb 

URI-Rc 


URI-R _ 

https : //apis .google. com/u/0/_/widget/render/page?usegapi=l&rel=publisher&href=7o2F7o 
2Fplus. google. com7o2F110743665890542265089&width=430&hl=en-GB&origin=http7o3A7o2F7o2Fwww. 
truthinshredding. com&gsrc=3p&ic=l&jsh=m7o3B7o2F_7o2Fscs7o2Fapps-static. . . 

^ https : //apis .google. com/_/scs/apps-static/_/ss/k=oz .widget. -yiilzpp4csh.L.W. 0/m=bdg/ain= 
AAAAAJAwAA4/d=l/rs=AItRSTNrapsz0r4y_tKMAlhZh6JM-glhaQ 
^ http://pagead2.googlesyndication.com/pagead/show_ads.js 
http: //www. youtube. com/embed/C)yL14Fd4cGA?rel=0 


Table 1: The URI-Rs in Figure 


While Heritrix provides a user interface that identifies the 
crawl frontier size, Phantom JS and wget do not. We cal¬ 
culate the frontier size of Phantom JS by counting the num¬ 
ber of embedded resources that Phantom JS requests when 
rendering the representation. We calculate the frontier size 
of wget by executing a commanc^that records the HTTP 
GET requests issued by wget during the process of mirror¬ 
ing a web resource and its embedded resources. We consider 
the frontier size to be the total number of resources and em¬ 
bedded resources that wget attempts to download. 

We began a crawl of the same 500 URI-Rs using wget, Her¬ 
itrix, and Phantom JS simultaneously to mitigate the im¬ 
pact of live Web resources changing state during the crawls. 
For example, if the representation changes (such as includes 
new embedded resources) in between the times wget. Phan¬ 
tom JS, and Heritrix perform their crawls, the number or 
representations of embedded resources may change and there¬ 
fore the representation influenced the crawl performance, 
not the crawler itself. 

We crawled live-Web resources because mementos inherit 
the limitations of the crawler used to create them. De¬ 
pending on crawl policies, a memento may be incomplete 
and different than the live resource. The robots.txt pro¬ 
tocol EZIES], breadth- versus depth-first crawling, or the 
inability to crawl certain representations (like deferred rep¬ 
resentations as we discuss in this paper) can all influence 
the mementos created during a crawl. 


Average Crawl Rate by Tool 
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Figure 2: Heritrix crawls 12.13 times faster than 
PhantomJS. The error lines indicate the standard 
deviation across all ten runs. 

crawl. First, Heritrix executes crawls in parallel with multi¬ 
ple threads being managed by the Heritrix software - this is 
not possible with PhantomJS on a single core machine since 
PhantomJS requires access to a headless browser and its as¬ 
sociated JavaScript engine, and parallelization will result in 
process and threading conflicts. Second, Heritrix does not 
execute the client-side JavaScript and only adds URIs that 
are extracted from the Document Object Model (DOM), 
embedded style sheets, and other resources to its frontier. 


5.1 Crawl Time by URI 

To better understand how crawl times of wget, PhantomJS, 
and Heritrix differ, we determined the time needed to ex¬ 
ecute a crawl. Heritrix has a browser-based user interface 
that provides the URIs/second (turn) metric. We collected 
this metric from the Web interface for each crawl. We used 
Unix system times to calculate the crawl time for each Phan¬ 
tomJS and wget crawl by determining the start and stop 
times for dereferencing each resource and its embedded re¬ 
sources. We compare the wget, PhantomJS, and Heritrix 
crawl times per URI in Figure and Table Heritrix out¬ 
performs PhantomJS, crawling 2.065 URIs/s while Phan¬ 
tomJS crawls 0.170 URIs/s and wget crawls 0.864 URIs/s. 
Heritrix crawls, on average, 12.13 times faster than Phan¬ 
tomJS and 2.39 times faster than wget. 

The performance difference comes from two aspects of the 


^We executed wget -T 40 -o outfile -p -0 headerFile 
[URI-R] which downloads the target URI-R and all embed¬ 
ded resources and dumps the HTTP traffic to headerFile. 


5.2 URI Discovery and Frontier Size 

We performed a string-matching de-duplication (that is, re¬ 
moving duplicate URIs) to determine the true frontier size 

m)- 


Crawl time Frontier Size 


Uyiawier 

turn 

^tURI 

1^1 

S|F| 

wget 

0.864 

0.855 

129,443 

3,213.65 

Heritrix 

2.065 

0.137 

302,961 

1,219.82 

PhantomJS 

0.170 

0.001 

531,484 

2,036.92 


Table 2: Mean and standard deviation of crawl time 
(URIs/s) and frontier size for wget, Heritrix, and 
PhantomJS crawls of 10,000 seed URIs. 

As shown in Figure and in Table we found that Phan¬ 
tomJS discovered and added 1.75 times more URI-Rs to its 
frontier than Heritrix, and 4.11 times more URI-Rs than 
wget. Per URI-R, PhantomJS loads 19.7 more embedded 
resources than Heritrix and 32.4 more embedded resources 
than wget. The superior PhantomJS frontier size is at- 




























Average Frontier Size by Tooi 
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Figure 3: PhantomJS discovers 1.75 times more em¬ 
bedded resources than Heritrix and 4.11 times more 
resources than wget. The averages and error lines 
indicate the standard deviation across all ten runs. 


tributed to its ability to execute JavaScript and discover 
URIs constructed and requested by the client-side scripts. 


However, raw frontier size is not the only performance metric 
for assessing the quality of the frontier. Phantom JS and 
Heritrix discover some of the same URIs, while Phantom JS 
discovers URIs that Heritrix does not and Heritrix discovers 
URIs that Phantom JS does not. We measured the union 
and intersection of the Heritrix and Phantom JS frontiers. 
As shown in Figure 4(a) per 10,000 URI-R crawl Heritrix 
hnds 39,830 URI-Rs missed by PhantomJS on average, while 
Phantom JS hnds 194,818 URI-Rs missed by Heritrix per 
crawl on average. PhantomJS and Heritrix hnd 63,550 URI- 
Rs in common between the two crawlers. The wget crawl 
resulted in a frontier of 24,589 URI-Rs, which was a proper 
subset of both the Heritrix and PhantomJS frontiers. 


This analysis shows that PhantomJS hnds 19.70 more em¬ 
bedded resources per URI than Heritrix (Figurej^. Heritrix 
runs 12.13 times faster than PhantomJS (Figure]^. Note 
that the red axis in Figures and are unmeasured and 
only projections of the measured trends, with the projec¬ 
tions predicting the performance as the seed list size grows. 

5.3 Frontier Properties 

During the PhantomJS crawls, we observed that PhantomJS 
discovers session-specihc URI-Rs that Heritrix misses and 
Heritrix discovers Top Level Domains (TLDs) that Phan¬ 
tomJS misses, presumably from Heritrix’s inspection of Java¬ 
Script. For example: 


http: //dg. specif icclick. net/?y=3&t=h&u=http7o3Ayo2Fyo2 
Fmisscellania.blogspot. comyo2Fstorageyo2F 
Twitter-2.png... 


from PhantomJS versus 


Unions and Intersections (String Matching) 



(a) A portion of Heritrix, PhantomJS, and wget frontiers 
overlap. PhantomJS and Heritrix identify URIs that the 
others do not. 


Unions and intersections (Fuzzy Matching) 



(b) The frontier of URI-Rs unique to PhantomJS shrinks 
when only considering the host and path aspects (Base Policy 
for matching) of the URI-R. 


http://dg.specificclick.net/ 

from Heritrix. The uniquely Heritrix URI-Rs are potentially 
the base of a URI to be further built by JavaScript. Be- 


Figure 4: Heritrix, PhantomJS, and wget frontiers 
as an Euler Diagram. The overlap changes depend¬ 
ing on how duplicate URIs are identified. 
























Figure 5: Frontier size grows linearly with seed size. Figure 6: Crawl speed is dependent upon frontier 

size. 


cause PhantomJS only discovers URIs for which the client is¬ 
sues HTTP requests, this URI-R is not discovered by Phan¬ 
tom JS. To determine the nature of the differences between 
the Heritrix and Phantom JS frontiers, we analyzed the union 
and intersection between the URI-Rs in the frontiers using 
different matching policies (Figure [4(b) ). 

During a crawl of 500 URI-Rs by PhantomJS, 19,022 URI- 
Rs were added to the frontier for a total of 19,522 URI-Rs 
in the frontier. We also captured the content body (the 
returned entity received when dereferencing a URI-R from 
the frontier) and recorded its MD5 hash value. We used 
the hash value to identify duplicate representations during 
the crawl. To determine duplication between URIs, we used 
hve matching policies to determine the duplication within 
the frontier (Table [^. In other words, we identify cases in 
which the URIs are different but the content is the same, 
similar to the methods used by Sigurdsson [311 [32]. 

The No Trim policy uses strict string matching of the URI- 
Rs to detect duplicates. The Base Trim policy trims all pa¬ 
rameters from the URL The Origin Trim policy eliminates 
all parameters and associated values that reference a refer¬ 
ring source, such as origin, callback, domain, or ref error. 
These parameters are often associated with a value includ¬ 
ing the top level domain of the referring page. Frequent 
implementers include Google Analytics or ad services. 

The Session Trim policy eliminates all parameters and their 
associated values that reference a session. For example, the 
parameters such as session, sessionid, token_id, etc. are 
all removed from the URI-R before matching. These pa¬ 
rameters are often used by ad services or streaming media 
services to identify browsing sessions for tracking and rev¬ 
enue generation purposes. 


The HTTP Trim policy removes all parameters with val¬ 
ues that mention a URL Ad services, JavaScript files, and 
other statistics tracking services frequently utilize these pa¬ 
rameters. Examples of each trim policy are shown in Table 

El 

We used the five trimming policies to detect duplicates in 
the frontiers constructed by PhantomJS in one of the crawls 
of 500 URI-Rs. At the end of the crawl, PhantomJS had a 
frontier of 19,522 URI-Rs. Using the MD5 hash of the repre¬ 
sentations, we determined that this set had 8,859 duplicate 
representations. With the trimmed URI and the MD5 hash 
of the entity, we can compare the identifiers and the returned 
entities for duplication. 


Accuracy = 


True Positives + True Negatives 
Number of Classifications 


( 1 ) 


F-Measure = 2 * 


Precision * Recall 
Precision + Recall 


( 2 ) 


For each of the 19,522 URIs in the frontier and their asso¬ 
ciated entity hash values, we determined the trimmed URI 
string and the duplications of URIs in the frontier and the 
number of delicate URIs that also had a duplicate entity 
body (Table ffl. We calculated the accuracy (Equation [l]|j 
of each trim policy using the number of URIs with the same 

^Accuracy is defined as the number of correctR classified in¬ 
stances divided by the test set size (Equation[y . E-Measure 
extends accuracy to consider the harmonic mean of precision 
and recall (Equation [^. 














Trim Policy 

Original URI-R 

Trimmed URI-R 

No Trim 

http://example.com/folder/index. 

http://example.com/folder/index. 


html?param=value 

html?param=value 

Origin Trim 

http://example.com/folder/index. 

http://example.com/folder/index.html 


html?callback=cs.odu.edu 


Base Trim 

http://example.com/folder/index. 

http://example.com/folder/index.html 


html?param=value 


Session Trim 

http://example.com/folder/index. 

http://example.com/folder/index. 


html?param=value&sessionid=12345 

html?param=value 

HTTP Trim 

http://example.com/folder/index. 

http://example.com/folder/index. 


html?param=value&httpParam=http: 

//www.test.com/ 

html?param=value 


Table 3: Examples of the URI trim policies. 


Trim Type 

URI 

Duplicates 

URI and 

Entity 

Duplicates 

Accuracy 

No Trim 

6,469 

4,684 

0.68 

Origin Trim 

7,078 

4,749 

0.68 

Base Trim 

10,359 

5,191 

0.56 

Session Trim 

8,159 

4,921 

0.64 

HTTP Trim 

7,315 

4,868 

0.67 


Table 4: Detected duplicate URIs, entity bodies, 
and the overlap between the two using the five URI 
string trimming policies. 

entity hash and URI as a true positive (TP), the number 
of URIs that had neither a duplicate URI nor a duplicate 
entity body as a true negative (TN), and the set of all pos¬ 
itives and negatives (P + N) as the total number of URIs 
(19,522). 

The Base Trim and No Trim policies had identical accu¬ 
racy ratings (0.68). The Base Trim policy identified the 
most URI duplicates, and is used to determine the overlap 
between the Heritrix and PhantomJS frontiers. 

Using the Base Trim policy to only consider the host and 
path (e.g., http://pubads .g.doubleclick.net/gampad/ads) 
of the Phantom JS and Heritrix frontiers. Phantom JS identi¬ 
fies 376,578 URI-Rs added to the frontier, 199,761 (55%) of 
which are duplicates of the discovered URIs. If we consider 
only the host and path of the Phantom JS URIs, the Euler 
Diagram of Phan tom J S and Heritrix frontiers is more evenly 
matched (Figure [4 (b)| . 

5.4 Deferred vs. Non-Deferred Crawls 

To isolate the impact of resources with deferred representa¬ 
tions on crawl performance, we manually classified 200 URI- 
Rs from our set of 10,000 URI-Rs as having deferred repre¬ 
sentations and another 200 as having non-deferred represen¬ 
tations. We crawled each of the deferred and non-deferred 
sets of URI-Rs with Phantom JS and Heritrix. 

During the crawl of the non-deferred set. Phantom JS crawled 
tt/i?/=0.255 URIs/s while Heritrix crawled URIs/s, 

5.25 times faster than Phantom JS. Heritrix uncovered 1,044 
URI-Rs to add to the frontier, while Phantom JS discovered 
403 URI-Rs to add to the frontier. This phenomenon of 


Heritrix having a larger frontier than Phantom JS is due to 
Heritrix’s policy of looking into the JavaScript files to ex¬ 
tract URIs found in the code - the URI-Rs discovered by 
Heritrix are top-level domains listed in the JavaScript that 
may be used to construct URIs at run time (e.g., appending 
a username or timestamp to the URI) or not used by Java¬ 
Script at all (e.g., a URI that exists in un-executed code). 

During the crawl of the deferred set, PhantomJS crawled 
tuRi—^-^ URIs/s. Heritrix ran t[/i?/=12.56 URIs/s, 25.12 
times faster than PhantomJS. Heritrix added 3,206 URIs 
to the frontier, while PhantomJS added 3,436 URIs to the 
frontier. PhantomJS adds more URIs to the frontier despite 
Heritrix’s introspection on the JavaScript of each crawl tar¬ 
get. This result is due to PhantomJS’s execution of Java¬ 
Script on the client. 

We observe that the PhantomJS frontier outperforms the 
Heritrix frontier during the deferred crawl. Heritrix crawls 
URIs faster than PhantomJS on each of the deferred and 
non-deferred crawls, but far exceeds the speed of PhantomJS 
during the deferred crawl. 

6. CLASSIFYING REPRESENTATIONS 

In practice, archival crawlers such as Heritrix would be able 
to identify URI-Rs that have low archivability in real-time. 
Heritrix currently does not have such an automatic capa¬ 
bility. Archive-It, for example, uses a manually curated list 
of URIs that have deferred representatiosn and uses Umbra 
m to crawl them. 

The ability to determine the archivability of a resource will 
allow Heritrix to assign the URI-R to either the faster, tra¬ 
ditional Heritrix crawler or the slower, PhantomJS (or other 
JavaScript-enabled crawler). By enabling this two-tiered ap¬ 
proach to crawling, the archival crawlers can achieve max¬ 
imum performance by utilizing the heavy-duty JavaScript- 
capable crawlers for only those that need it. However, this 
approach requires the ability to, in real-time, recognize or 
predict a deferred representation. 

Even though our goal is to detect whether or not represen¬ 
tations are dependent on JavaScript, the simple presence of 
JavaScript is not a sufficient indicator of a deferred repre¬ 
sentation. In our set of URI-Rs, the resources with deferred 
representations had, on average, 21.98 embedded script tags 
or files, while the resources with non-deferred representa- 
























tions had 5.3 script tags or files. Of those resources with 
deferred representations, 84.1% had at least one script tag, 
while 49.5% of the non-deferred representations had at least 
one script tag. Because of the ubiquity of JavaScript in 
both deferred and non-deferred representations, we opted 
for a more complex feature vector to represent the features 
of the representations. 

In an effort to predict whether or not a representation would 
be deferred, we constructed a feature vector of DOM at¬ 
tributes and features of the embedded resources. We used 
Weka [13] to classify the resources on subsets of the fea¬ 
ture vectors to gauge their performance. We extracted the 
following feature vector: 

1. Ads: Using a list of known advertisement domains, 
we determined whether or not a representation would 
load an ad based on DOM and JavaScript analysis. 

2. Script Tags: We counted the number of script tags 
with JavaScript, both in files and embedded code. 

3. Interactive Elements: We counted the number of 
DOM elements that have JavaScript events attached 
to them (e.g., onclick, onload). 

4. Ajax (in JavaScript): To estimate the number of 
Ajax calls (e.g., $. get (), XmlHttpRequest) we counted 
the number of occurrences of Ajax requests in the em¬ 
bedded external and independent JavaScript files. 

5. Ajax (in HTML): To estimate the number of Ajax 
calls (e.g., $.get(), XmlHttpRequest) we counted the 
number of occurrences of Ajax requests in Script tags 
embedded in the DOM. 

6. DOM Modifications: We counted the number of 
times JavaScript made a modification of the DOM 
(e.g., via the appendChildO function) to account for 
DOM modifications after the initial page load. 

7. JavaScript Navigation: We counted the occurrences 
of JavaScript redirection and other navigation func¬ 
tions (e.g., window.location calls). 

8. JavaScript Storage: We count the number of Java¬ 
Script references to storage elements on the client (e.g., 
cookies) as an indication of client-controlled state. 

9. Found, Same Domain: Using PhantomJS, we counted 
the number of embedded resources originating from 
the URI-R’s top level domain (TLD) that were suc¬ 
cessfully dereferenced (i.e., returned an HTTP 200). 

10. Missed, Same Domain: Using Phantom JS, we counted 
the number of embedded resources originating from 
the URI-R’s TLD that were not successfully derefer¬ 
enced (i.e., returned a class HTTP 400 or 500). 

11. Found, Different Domain: Using Phantom JS, we 
counted the number of embedded resources originat¬ 
ing outside of the URI-R’s TLD that were successfully 
dereferenced (i.e., returned an HTTP 200). 

12. Missed, Different Domain: Using PhantomJS, we 
counted the number of embedded resources originating 
outside of the URI-R’s TLD that were unsuccessfully 
dereferenced (i.e., a class 400 or 500 HTTP response). 

We manually sampled 440 URI-Rs (from o ur c ollection of 
10,000, including the same 400 from Section [A4| and classi¬ 
fied the representations as deferred or non-deferred, with 200 


Actual Predicted Classification 

Classification 

Deferred 
Non-Deferred 


Table 5: Confusion matrix for the entire feature vec¬ 
tor (F-Measure = 0.791). 

Actual Predicted Classification 

Classification Deferred Non-Deferred 
Deferred 
Non-Deferred 


179 

41 

47 

173 


Deferred Non-Deferred 


182 

38 

58 

166 


Table 6: Confusion matrix for the resource features 
(features 9-12 of the vector; F-Measure = 0.844). 


training and 20 test URI-Rs for each based on whether or 
not their representations were dependent upon JavaScript. 

Using PhantomJS, we collected the 12 features required for 
a feature vector for each of our 440 URI-Rs. Using Weka, we 
ran each classifier on the feature vectors. Rotation Forests 
m performed the best of any of the standard Weka classi¬ 
fiers for any of our datasets. 

We used three subsets of the feature vector to investigate 
the best method of predicting deferred representations. We 
selected attributes 1-8 to represent DOM features. We se¬ 
lected attributes 9-12 as embedded resource attributes (the 
attributes we extract if we load and monitor the embedded 
resources). Together, attributes 1-12 make up the entire 
dataset. We use the feature sets to train and test our classi¬ 
fier via 10-fold cross validation. We use the same three data 
subsets and provide a confusion matrix of each set including 
the entire feature vector (Table [^, resource feature vector 
(Table [^, and DOM feature vector (Table [^. 

The accompanying statistics for the classifications are shown 
in Tablej^ With only the DOM features, the test set is accu¬ 
rately classified representations as deferred or non-deferred 
79% of the time. If we combine the DOM and resource 
feature sets to create the full feature set, we can correctly 
classify representations 81% of the time. 


After a URI is dereferenced and a representation is returned, 
we can determine whether or not the representation is de¬ 
ferred with 79% accuracy. If we also dereference the URIs 
for the embedded resources and monitor the HTTP status 
codes, we can increase, albeit minimally, the accuracy of 
the prediction to 81% of the time. However, crawling with 
PhantomJS is much more expensive when executed properly. 
Due to this minimal improvement and much higher cost to 
measure, the feature extraction will be limited to the DOM 


Actual Predicted Classification 

Classification 

Deferred 
Non-Deferred 


Table 7: Confusion matrix for the DOM features 
(features 1-8 of the vector; F-Measure = 0.806). 


Deferred Non-Deferred 


168 

52 

41 

179 











Features 

Classification 

Accuracy F-measure Precision Recall 

DOM 

Features Only 

Deferred 

Non-deferred 

79% 

79% 

78% 

76% 

81% 

80% 

DOM & Resource 
Features 

Deferred 

Non-deferred 

81% 

82% 

79% 

90% 

81% 

80% 


Table 8: Classification success statistics for DOM-only and DOM and Resource feature sets. 


classification. With a negligible impact on performance, our 
classifier is able to identify deferred representations using 
the DOM crawled by Heritrix with 79% accuracy. 


the DOM of the resource for classification. Subsequently, 
only if the representation is classified as deferred will Phan¬ 
tom JS be used to crawl the resource to ensure the maximum 
amount of embedded resources are retrieved. 


7. TWO-TIERED CRAWLING 

To benefit from the increased crawl frontier size of Phan¬ 
tom JS while maintaining the performance of Heritrix, we 
propose a tiered crawling approach in which Phantom JS is 
used to crawl only resources with deferred representations. 
A tiered approach to crawling would allow an archive to 
simultaneously benefit from the frontier size of Phantom JS 
and the speed of Heritrix. Tablej^provides a summary of the 
extrapolated crawl speed and discovered frontier size of each 
crawler. While the test environment used a single system, 
a production environment should expect to see performance 
improvements with additional resources. PhantomJS crawls 
are not run in parallel, and additional nodes for PhantomJS 
threads will further improve performance. 


Crawl Strategy 

Crawl Time 
(hrs) 

Crawl Rate 
(turn) 

Frontier Size 

i\F\) 

wget 

416.16 

0.864 

129,443 

Heritrix 

407.53 

2.065 

302,961 

PhantomJS 

8,684.38 

0.170 

531,484 

Heritrix + 
PhantomJS 

9,100.54 

0.152 

537,609 

Heritrix + 
PhantomJS 

6,495.23 

0.196 

458,815 

with Classifier 





Table 9: A summary of extrapolated performance 
(based on our calculations) of single- and two-tiered 
crawling approaches. 

We have described the operation of crawls with wget, Her¬ 
itrix, and PhantomJS in Sections |5.1| and |5.4| with wget serv¬ 
ing as a baseline to which Heritrix and PhantomJS can be 
compared but wget is not part of the archival workflow we 
investigate. To reiterate, Heritrix crawls much more quickly 
than PhantomJS, while PhantomJS discovers many more 
embedded resources required to properly construct a repre¬ 
sentation. Optimally during a crawl, Heritrix would derefer¬ 
ence a URI-R and run the resulting DOM through the clas¬ 
sifier to determine whether or not the representation will be 
deferred (with 79% accuracy, as discussed in Section]^. If 
the representation is predicted to be deferred, PhantomJS 
should also be used to crawl the URI-R and add the newly 
discovered URI-Rs to the Heritrix frontier. 

Heritrix should be used to crawl all URI-Rs in the frontier 
because the DOM is required to classify a representation as 
deferred. Since Heritrix is the fastest crawler, it should be 
used to dereference the URI-Rs in the frontier and retrieve 


In a naive two-tiered crawl strategy that will discover the 
most embedded URI-Rs and create the largest frontier, Her¬ 
itrix and PhantomJS should both crawl each URI-R regard¬ 
less of whether the representation can be classified as de¬ 
ferred or non-deferred. This creates a crawl that is expected 
to be 13.5 times slower than simply using Heritrix, but is ex¬ 
pected to discover 1.77 times more URI-Rs than using only 
Heritrix. This would ensure that 100% of all resources with 
deferred representations would be crawled with both Her¬ 
itrix and PhantomJS. However, we want to limit the use of 
PhantomJS to minimize the performance impacts it has on 
the crawl speed. 

If we include the classifier to predict when PhantomJS should 
be used or when Heritrix will be a suitable tool, the two- 
tiered approach is expected to run 10.5 times slower and 
is expected to discover 1.5 times more URI-Rs than only 
Heritrix. This crawl policy balances the trade-offs between 
speed and larger frontier size by using the classifier to in¬ 
dicate when to use PhantomJS to crawl resources with de¬ 
ferred representations. 


To validate this expected calculation, we classified our 10,000 
URI-R dataset, which produced 5,187 URI-Rs classified as 
having deferred representations, and 4,813 as having non- 
deferred representations. We used PhantomJS to crawl the 
URI-Rs classified as deferred, and only Heritrix to crawl the 
URI-Rs classified as non-deferred. The results of the crawls 
are detailed in Table Co] 


Crawler URI-R Set 


Seed Frontier Crawl 
Size Size Time (hrs) 


P 

Deferred 

5,187 

311,903 

84.9 

H 

Non-deferred 

4,813 

124,728 

23.6 

H 

Deferred 

5,187 

171,499 

26.7 

P 

All URI-Rs 

10,000 

438,388 

686 

H 

All URI-Rs 

10,000 

275,234 

48.3 

Two-tier 

All URI-Rs 

10,000 

399,202 

133 


Table 10: A simulated two-tiered crawl showing that 
the frontier sizes can be optimized while mitigating 
the performance impact of PhantomJS’s (P) crawl 
speed vs Heritrix’s (H). 

In this table, we show that PhantomJS creates a frontier 
of 438,388, 1.6 times larger than that of Heritrix. How¬ 
ever, PhantomJS crawls 14 times slower than Heritrix. If 
we perform a tiered crawl in which PhantomJS is responsi¬ 
ble for crawling only deferred representations, we can crawl 































5.2 times faster than using only PhantomJS (but 2.7 times 
slower than the Heritrix-only approach) while creating a 
frontier 1.8 times larger than using only Heritrix. As a re¬ 
sult, we can maximize the frontier size, mitigate the impacts 
of JavaScript on crawling, and mitigate the impact of the re¬ 
duced crawl speeds when using a tiered crawling approach. 

8. CONCLUSIONS 

In this paper, we measured the differences in crawl speed 
and frontier size of wget. Phantom JS, and Heritrix. While 
Phantom JS was the slowest crawler, it provided the largest 
crawl frontier due to its ability execute client-side JavaScript 
to discover URIs missed by Heritrix and wget. Heritrix was 
the fastest crawler. We also proposed a tiered approach to 
crawling in which a classifier determines whether to crawl a 
resource with Phantom JS to reap the URI discovery benefits 
of the specialized crawler where appropriate. 

This work lays the foundation for a two-tiered crawling ap¬ 
proach and helps predict the performance of future archival 
workflows. We know that PhantomJS finds 19.70 more em¬ 
bedded resources per URI and Heritrix runs 12.13 times 
faster than PhantomJS, meaning the crawler should avoid 
crawling URI-Rs with non-deferred representations to main¬ 
tain an optimal performance trade-off. We understand that 
PhantomJS is required to discover the embedded resources 
needed to complete a deferred representation that Heritrix 
cannot discover. This has a performance detriment to run 
time, but offers a benefit of more complete mementos and 
a larger frontier for crawling. We also found that 53% of 
URIs discovered by PhantomJS are duplicates if we remove 
session-specific URI parameters. 

Using DOM features we can accurately predict deferred and 
non-deferred representations 79% of the time. Using this 
classification, deferred representations can be crawled by 
PhantomJS to ensure all embedded resources are added to 
the crawl frontier. 

If using a multi-tiered approach to crawling, archives can 
leverage the benefits of PhantomJS and Heritrix simulta¬ 
neously. That is, using a deferred representation classi¬ 
fier, archives can use PhantomJS for deferred representa¬ 
tions and Heritrix for non-deferred representations. Using a 
tiered crawling approach, we showed that crawls will run 5.2 
times faster than using only PhantomJS, create a frontier 1.8 
times larger than using only Heritrix. This crawl strategy 
mitigates the impact of JavaScript on archiving while also 
mitigating the reduced crawl speed of PhantomJS. 

Our future work will include a framework for archiving de¬ 
ferred representations, along with a measurement of the 
archival improvement when implementing a deferred rep¬ 
resentation crawler. 
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